Distribution learning theory について

Words near each other

Dictionary Lists

mini英和辞書

mini和英辞書

翻訳と辞書　辞書検索 [ 開発暫定版 ]

スポンサードリンク

Distribution learning theory ：ウィキペディア英語版

Distribution learning theory
The distributional learning theory or learning of probability distribution is a framework in computational learning theory. It has been proposed from Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert Schapire and Linda Sellie in 1994 〔M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. Schapire, L. Sellie ''On the Learnability of Discrete Distributions''. ACM Symposium on Theory of Computing, 1994 ()〕 and it was inspired from the PAC-framework introduced by Leslie Valiant.〔(L. Valiant ''A theory of the learnable''. Communications of ACM, 1984 )〕
In this framework the input is a number of samples drawn from a distribution that belongs to a specific class of distributions. The goal is to find an efficient algorithm that, based on these samples, determines with high probability the distribution from which the samples have been drawn. Because of its generality this framework it has been used in a large variety of different fields like machine learning, approximation algorithms, applied probability and statistics.
This article explains the basic definitions, tools and results in this framework from the theory of computation point of view.
==Basic Definitions==

Let

\textstyle X

be the support of the distributions that we are interested in. As in the original work of Kearns et. al.〔 if

\textstyle X

is finite it can be assumed without loss of generality that

\textstyle X = \^n

where

\textstyle n

is the number of bits that have to be used in order to represent any

\textstyle y \in X

. We focus in probability distributions over

\textstyle X

.
There are two possible representations of a probability distribution

\textstyle D

over

\textstyle X

.
* probability distribution function (or evaluator) an evaluator

\textstyle E_D

for

\textstyle D

takes as input any

\textstyle y \in X

and outputs a real number

\textstyle E_D()

which denotes the probability that of

\textstyle y

according to

\textstyle D

, i.e.

\textstyle E_D() = \Pr (= y)

\textstyle Y \sim D

.
* generator a generator

\textstyle G_D

for

\textstyle D

takes as input a string of truly random bits

\textstyle y

and outputs

) \in X

according to the distribution

\textstyle D

. Generator can be interpreted as a routine that simulates sampling from the distribution

\textstyle D

given a sequence of fair coin tosses.
A distribution

\textstyle D

is called to have a polynomial generator (respectively evaluator) if its generator (respectively evaluator) exists and can be computed in polynomial time.
Let

\textstyle C_X

a class of distribution over X, that is

\textstyle C_X

is a set such that every

\textstyle D \in C_X

is a probability distribution with support

\textstyle X

. The

\textstyle C_X

can also be written as

\textstyle C

for simplicity.
Before defining learnability its necessary to define good approximations of a distribution

\textstyle D

. There are three ways to measure the distance between two distribution. The three more common possibilities are
* Kullback-Leibler divergence
* Total variation distance
* Kolmogorov distance
The strongest of these distances is the Kullback-Leibler divergence and the weakest is the Kolmogorov distance. This means that for any pair of distributions

\textstyle D

\textstyle D'

:
:

KL-distance(D, D') \ge TV-distance(D, D') \ge Kolmogorov-distance(D, D')

Therefore for example if

\textstyle D

and

\textstyle D'

are close with respect to Kullback-Leibler divergence then they are also close with respect
to all the other distances.
Next definitions hold for all the distances and therefore the symbol

\textstyle d(D, D')

denotes the distance between the distribution

\textstyle D

and the distribution

\textstyle D'

using one of the distances that we describe above. Although learnability of a class of distributions can be defined using any of these distances, applications refer to a specific distance.
The basic input that we use in order to learn a distribution is an number of samples drawn by this distribution. For the computational point of view the assumption is that such a sample is given in a constant amount of time. So it's like having access to an oracle

\textstyle GEN(D)

that returns a sample from the distribution

\textstyle D

. Sometimes the interest is, apart from measuring the time complexity, to measure the number of samples that have to be used in order to learn a specific distribution

\textstyle D

in class of distributions

\textstyle C

. This quantity is called sample complexity of the learning algorithm.
In order for the problem of distribution learning to be more clear consider the problem of supervised learning as defined in.〔Lorenzo Rosasco, Tomaso Poggio, "A Regularization Tour of Machine Learning — MIT-9.520 Lectures Notes" Manuscript, Dec. 2014 ()〕 In this framework of statistical learning theory a training set

\textstyle S = \

and the goal is to find a target function

\textstyle f : X \rightarrow Y

that minimizes some loss function, e.g. the square loss function. More formally

f = \arg \min_ \int V(y, g(x)) d\rho(x, y)

, where

V(\cdot, \cdot)

is the loss function, e.g.

V(y, z) = (y - z)^2

and

\rho(x, y)

the probability distribution according to which the elements of the training set are sampled. If the conditional probability distribution

\rho_x(y)

is known then the target function has the closed form

f(x) = \int_y y d\rho_x(y)

. So the set

S

is a set of samples from the probability distribution

\rho(x, y)

. Now the goal of distributional learning theory if to find

\rho

given

S

which can be used to find the target function

f

.
Definition of learnability
''A class of distributions

\textstyle C

is called efficiently learnable if for every

\textstyle \epsilon > 0

and

\textstyle 0 < \delta \le 1

given access to

\textstyle GEN(D)

for an unknown distribution

\textstyle D \in C

, there exists a polynomial time algorithm

\textstyle A

, called learning algorithm of

\textstyle C

, that outputs an generator or an evaluator of a distribution

\textstyle D'

such that''
:

) \ge 1 - \delta

''If we know that

\textstyle D' \in C

then

\textstyle A

is called proper learning algorithm, otherwise is called improper learning algorithm.''
In some settings the class of distributions

\textstyle C

is a class with well known distributions which can be described by set a set of parameters. For instance

\textstyle C

could be the class of all the Gaussian distributions

\textstyle N(\mu, \sigma^2)

. In this case the algorithm

\textstyle A

should be able to estimate the parameters

\textstyle \mu, \sigma

. In this case

\textstyle A

is called parameter learning algorithm.
Obviously the parameter learning for simple distributions is a very well studied field that is called statistical estimation and there is a very long bibliography on different estimators for different kinds of simple known distributions. But distributions learning theory deals with learning class of distributions that have more complicated description.

抄文引用元・出典: フリー百科事典『ウィキペディア（Wikipedia）』
■ウィキペディアで「Distribution learning theory」の詳細全文を読む

スポンサードリンク

翻訳と辞書 : 翻訳のためのインターネットリソース